- Course pages
- Course overview
- Introduction to SLV
- (Dark) Data Science
- Data Wrangling
- Wrap-up
Supervised Learning and Visualization
If there is anything important - contact me!
The on-location lectures will not be recorded.
If you feel that you are stuck, ask your classmates, ask me, ask the other lecturers. Ask a lot! Ask questions during/after the lectures and in the Q&A sessions.
If you expect that you are going to miss some part(s) of the course, please notify me via a private MS-Teams message or e-mail.
You can find all materials at the following location:
All course materials should be submitted through a pull-request from your Fork of
The structure of your submissions should follow the corresponding repo’s README. To make it simple, I have added an example for the first practical. If you are unfamiliar with GitHub, forking and/or pull-request, please study this exercise from another course. There you can find video walkthroughs that detail the process.
All three have a PhD in statistics and a ton of experience in development, data analysis and visualization.
| Week # | Focus | Teacher | Materials |
|---|---|---|---|
| 1 | Data wrangling with R |
DG | R4DS ISLR |
| 2 | The grammar of graphics | DG | R4DS |
| 3 | Exploratory data analysis | DG | R4DS FIMD |
| 4 | Statistical learning: regression | MC | ISLR, TBD |
| 5 | Statistical learning: classification | EJvK | ISLR, TBD |
| 6 | Classification model evaluation | EJvK | ISLR, TBD |
| 7 | Nonlinear models | MC | ISLR, TBD |
| 8 | Bagging, boosting, random forest and support vector machines | MC | ISLR, TBD |
Each weak we have the following:
Twice we have:
Once we have:
We will make groups on Wednesday Sept 13!
We begin this course series with a bit of statistical inference.
Statistical inference is the process of drawing conclusions from truths
Truths are boring, but they are convenient.
Q3: Would that mean that if we simply observe every potential unit, we would be unbiased about the truth?
The problem is a bit larger
We have three entities at play, here:
The more features we use, the more we capture about the outcome for the cases in the data
The more cases we have, the more we approach the true information
All these things are related to uncertainty. Our model can still yield biased results when fitted to \(\infty\) features. Our inference can still be wrong when obtained on \(\infty\) cases.
The problem is a bit larger
We have three entities at play, here:
The more features we use, the more we capture about the outcome for the cases in the data
The more cases we have, the more we approach the true information
Core assumption: all observations are bonafide
When we do not have all information …
In some cases we estimate that we are only a bit wrong. In other cases we estimate that we could be very wrong. This is the purpose of testing.
The uncertainty measures about our estimates can be used to create intervals
Confidence intervals can be hugely informative!
If we sample 100 samples from a population, then a 95% CI will cover the population value on average 95 out of 100 times.
Prediction intervals can also be hugely informative!
Prediction intervals are generally wider than confidence intervals
Narrower intervals mean less uncertainty. It does not mean less bias!
Whenever I evaluate something, I tend to look at three things:
As a function of model complexity in specific modeling efforts, these components play a role in the bias/variance tradeoff
We now have a new problem:
We now have a new problem:
There are two sources of uncertainty that we need to cover:
More challenging if the sample does not randomly come from the population or if the feature set is too limited to solve for the substantive model of interest
We don’t. In practice we may often lack the necessary comparative truths!
For example:
| Exploratory | Confirmatory | |
|---|---|---|
| Description | EDA; unsupervised learning | Correlation analysis |
| Prediction | Supervised learning | Theoretical modeling |
| Explanation | Visual mining | Causal inference |
| Prescription | Personalised medicine | A/B testing |
Exploratory Data Analysis:
Describing interesting patterns: use graphs, summaries, to understand subgroups, detect anomalies, understand the data
Examples: boxplot, five-number summary, histograms, missing data plots, …
Supervised learning:
Regression: predict continuous labels from other values.
Examples: linear regression, generalized additive model, regression trees,…
Classification: predict discrete labels from other values.
Examples: logistic regression, support vector machines, classification trees, …
How do you think that data analysis relates to:
People from different fields (such as statistics, computer science, information science, industry) have different goals and different standard approaches.
data analysis.In this course we emphasize on drawing insights that help us understand the data.
36 years ago, on 28 January 1986, 73 seconds into its flight and at an altitude of 9 miles, the space shuttle Challenger experienced an enormous fireball caused by one of its two booster rockets and broke up. The crew compartment continued its trajectory, reaching an altitude of 12 miles, before falling into the Atlantic. All seven crew members, consisting of five astronauts and two payload specialists, were killed.
In the decision process that led to the unfortunate launch of spaceshuttle challenger, some dark data existed.
Dark data is information that is not available.
Such unavailable information can mislead people. The notion that we could potentially be misled is important, because we then need to accept that our outcome analysis or decision process might be faulty.
If you do not have all information, there is always a possibility that you arrive at an invalid conclusion or a wrong decision.
When high risk decisions are at hand, it paramount to analyze the correct data.
When thinking about important topics, such as whether to stay in school, it helps to know that more highly educated people tend to earn more, but also that there is no difference for top earners.
Before John Snow, people thought “miasma” (some form of bad air) caused cholera and they fought it by airing out the house. It was not clear whether this helped or not, but people thought it must because “miasma” theory said so
Election polls vary randomly from day to day. Before aggregating services like Peilingwijzer, newspapers would make huge news items based on noise from opinion polls.
If we know flu is coming two weeks earlier than usual, that’s just enough time to buy shots for very weak people (but be aware: Changing data conditions).
If we know how ecosystems are affected by temperature change, we know how our forests will change in the coming 50-100 years due to climate change.
The examples have in common that data analysis and the accompanying visualizations have yielded insights and solved problems that could not be solved without them.